import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data = pd.read_excel('Credit Card Customer Data.xlsx')
data.head()
data.isnull().sum()
No null data
data.dtypes
data.shape
data.describe().T
Avg_credit_limit ranges from 3000 to 200000, whereas other important features Total_Credit_Cards,Total_visits_bank, Total_visits_online and Total_calls_made range between ~ 0-15. This wide distribution can lead to biased results, hence the data needs scaling before analysis.
data_drop = data.drop(['Sl_No', 'Customer Key'], axis=1)
data_drop.shape
for i in data_drop.columns:
sns.distplot (data_drop[i], kde = True)
plt.show();
By looking at the kde plot we can see frequency of peak values of each feature. The peaks in the kde plot shows the mode and based on that we can approximately guess the number of distinct clusters any particular feature forms.
Avg_Credit_Limit shows 3, Totat_Credit_Cards shows 4, Total_visits_bank show 3, Total_visits_online shows 3 and Total_call_made shows 2 peaks which indicates minimum number of probable separate clusters.
sns.pairplot(data_drop, diag_kind='kde');
Even though we see one or two points slightly far from the group of data, for example we see a point with 50000 Avg_Credit_Limit and Total_calls_made is 9, eventhough this point looks like an outlier, the values seem realistic and hence will not be removing any data.
from scipy.stats import zscore
dataScaled= data_drop.apply(zscore)
dataScaled.shape
dataScaled_2 = dataScaled.copy()
Making a clean copy of scaled dataframe
from sklearn.cluster import KMeans
clusters=range(1,10)
SumOfSquaredDistances=[]
for k in clusters:
model=KMeans(n_clusters=k)
model_fit = model.fit(dataScaled)
SumOfSquaredDistances.append(model_fit.inertia_)
plt.plot(clusters, SumOfSquaredDistances, 'rx-')
plt.xlabel('Number of clusters', fontsize=14)
plt.ylabel('Sum of squared distances', fontsize=14)
plt.title('Elbow Method', fontsize=18)
plt.show()
The above Elbow method shows the inside cluster sum of squares for different number of clusters. We can see that for 3 clusters, most of the inertia is reduced and it further decreases very slowly. This indicates most optimum number of clusters for this data is 3.
kmeans = KMeans(n_clusters=3, random_state=12)
kmeans_fit = kmeans.fit(dataScaled)
data['labels_kmeans'] = kmeans_fit.labels_
data_drop['labels_kmeans'] = kmeans_fit.labels_
dataScaled['labels_kmeans'] = kmeans_fit.labels_
data.head()
sns.pairplot(data_drop, hue='labels_kmeans', diag_kind='kde', palette='tab10');
dataScaled.boxplot(by='labels_kmeans', layout=(2,5), figsize=(11,10), grid=True);
Each cluster is distinct from one another. There are some overlaps between clusters if we look at just one feature. For example in Avg_Credit_Limit the clusters 0 and 2 overlaps. However, if we look at two features at a time, we can see distinct separation between them as shown in the pairplots. For example: between Avg_Credit_Limit and Total_visits_online we can clearly see separate clusters of green, blue and orange.
#pip install plotly==4.14.3
Note: If plotly is not installed previously, you'll need to install them using the above command, then restart the Kernel and re-run all the above codes to get a 3D plot
import plotly.express as px
for i in ['Total_visits_bank','Total_visits_online', 'Total_calls_made' ]:
fig = px.scatter_3d(dataScaled, x='Avg_Credit_Limit', y='Total_Credit_Cards', z= i,
color='labels_kmeans')
fig.show()
The 3-dimensional views shown above clearly shows separation of clusters when the total number of clusters are 3. The dark pink points form a separate cluster far from the other two, blue and yellow points are close to each other and also have some overlap.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist
Z_single = linkage(dataScaled_2, metric='euclidean', method='single')
c_single, coph_dists_single = cophenet(Z_single , pdist(dataScaled_2))
c_single
Z_ward = linkage(dataScaled_2, metric='euclidean', method='ward')
c_ward, coph_dists_ward = cophenet(Z_ward , pdist(dataScaled_2))
c_ward
Z_complete = linkage(dataScaled_2, metric='euclidean', method='complete')
c_complete, coph_dists_complete = cophenet(Z_complete , pdist(dataScaled_2))
c_complete
Z_weighted = linkage(dataScaled_2, metric='euclidean', method='weighted')
c_weighted, coph_dists_weighted = cophenet(Z_weighted , pdist(dataScaled_2))
c_weighted
Z_avg = linkage(dataScaled_2, metric='euclidean', method='average')
c_avg, coph_dists_avg = cophenet(Z_avg , pdist(dataScaled_2))
c_avg
temp = ['Z_avg',' Z_complete',' Z_single',' Z_ward',' Z_weighted']
j=0
for i in [Z_avg, Z_complete, Z_single, Z_ward, Z_weighted]:
plt.figure(figsize=(10, 5))
plt.title(temp[j])
j=j+1
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(i, leaf_rotation=45 , leaf_font_size=12 )
plt.tight_layout()
Most of the dendograms shows similar 3 big clusters except when linkage is single. To choose the best linkage, we have calculated the cophenetic correlations. Average linkage method seems to give the highest cophenetic correlation of ~0.9, indicating its the most faithful to the data.
hierarchy = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
hierarchy_fit = hierarchy.fit(dataScaled_2)
data['labels_hierarchy'] = hierarchy_fit.labels_
data_drop['labels_hierarchy'] = hierarchy_fit.labels_
dataScaled['labels_hierarchy'] = hierarchy_fit.labels_
data.head()
dataScaled.drop(['labels_kmeans'], axis=1).boxplot(by='labels_hierarchy', layout=(2,5), figsize=(11,10), grid=True);
The clusters formed using Hierarchy method is very similar to that of K-means clustering and we see distinct separation and some overlap in the 2D box-plot.
from sklearn.metrics import silhouette_score
print ('The Silhouette Score for Heirarchy method is: ', silhouette_score(dataScaled_2, hierarchy_fit.labels_))
print ('The Silhouette Score for K-means method is: ', silhouette_score(dataScaled_2, kmeans_fit.labels_))
The avg. score is greater than 0.5, which means the clustering is good.
The score for both K-means and Heirarchy are almost the same suggesting both algorithms are equally good for the given data.
Using K-means as an example, let us see if the Silhouette score increases with decreasing or increasing the number of clusters
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm
X = dataScaled_2.to_numpy()
X.shape
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
fig, (ax1) = plt.subplots(1, 1)
fig.set_size_inches(5, 5)
ax1.set_xlim([-0.1, 1])
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([])
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
We can see from the above plots that the silhouette score is maximum when the total number of clusters is 3. The score decreases when the clusters increase or decrease indicating that 3 is the optimum number. We can also see that some samples are having negative silhouette score when number of clusters = 2,5,6 signifying it is being assigned in a wrong cluster.
km = data_drop.drop(['labels_hierarchy'], axis=1).groupby(['labels_kmeans'])
km.mean()
hc = data_drop.drop(['labels_kmeans'], axis=1).groupby(['labels_hierarchy'])
hc.mean()
By looking at the clusters in the box plot for both Kmeans and Hierarchy method along with silhouette scores for both, we can conclude that both algorithms have given very similar clustering output.
This can be further verified by calculating the mean values of the clusters for different variables as shown in the above two tables. The mean values for clusters 0, 1 and 2 are almost the same.
How many different segments of customers are there?\ How are these segments different from each other?\ What are your recommendations to the bank on how to better market to and service these customers?
There are mainly three separate segments of customers.
First are the customers with higher credit limit, larger number of credit cards and does most of the visits online.\ Second are the customers with lower credit limit, lower number of credit cards and they make the most phone calls. \ Finally, the third group have credit limit and number of credit cards somewhere in the middle between the first and the second group and they mostly visit the bank in person.
To target new customers, especially the first group, online marketing campaigns such as sending emails, contacting people who visit the website, LinkedIn posts etc will help. These people might already have high income and good credit limit, the best way to attract them is by giving higher credit limit credit cards. The second group of customers can be enhanced by reaching out to them through direct phone calls. Finally, by sending offers and information with regular mails, third group of customers can be captured more.
To run personalized targets for existing customers, they need to start by increasing the credit limits or issuing new credit cards for groups 2 and 3. Based on this cluster analysis the Operations team can streamline their service delivery based on the group numbers any particular person fall into. For example: group 1 people mostly visit online and hence an online chat/email support for them. Group 2 mostly call and some also visit online, hence a phone support for them. Group 3 are the people who visit the bank in person, hence they can be greeted face to face for support.
Gathering more information can also help in making these clusters even better. For example, adding 'Age group' to the current data might help in separating the clusters and understanding customers even better. The group 3 people mostly visit the bank, maybe they are old people who are not used to technology of online chatting and phone calls, hence they visit in person.
Cluster analysis has helped in getting to know more about our customers can definitely help in increasing the customer base.